2: Extract spatial data based on spatial polygons

1 Introduction

This tutorial show how too extract spatial control variables based on surveys administrative divisions. The survey refers to Ethiopia 2019 and comes from the The World Bank Living Standards Measurement Study (LSMS). The spatial variables are nighttime light, agroecological zones, Urban-Rural Catchment Area, elevation, and climatic parameters. More information about these datasets are in the Appendix.


2 Code

2.1 Set Up

We start by setting up the stage for our analysis. First, we load the necessary packages. We load only climatic4economist package that contains several functions meant to extract and merge spatial variables with surveys. During the tutorial we will use other packages but instead of loading all the package at the begging we will call specific function each time.

library(climatic4economist)

In the setup, we also want to create the paths to the various data sources and load the necessary functions for extraction. Note .. means one step back to the folder directory, i.e. one folder back.

Note that how to set up the paths depends on your folder organization but there are overall two approaches:

  1. you can use the R project, by opening the project directly you don’t need to set up the path to the project. Automatically the project figures out on its own where it is located in the computer and set that path as working folder.
  2. you can manually set the working folder with the function setwd().
# path to data folder
path_to_data <- file.path("..",
                          "..", "data")

# survey and administrative division
path_to_survey  <- file.path(path_to_data, "survey", "LSMS", "LSMS_ETH19.dta")
path_to_adm_div <- file.path(path_to_data, "adm_div", "geoBoundaries")

# weather variables
path_to_pre <- file.path(path_to_data, "weather", "ERA5_Land", "AFR", "monthly",
                         "afr_month_50_25_tpr.nc")
path_to_tmp <- file.path(path_to_data, "weather", "ERA5_Land", "AFR", "monthly",
                         "afr_month_50_25_tmp.nc")

# control variables
path_to_elevation  <- file.path(path_to_data, "spatial", "elevation", "GloFAS",
                               "elevation_glofas_v4_0.nc")
path_to_urca       <- file.path(path_to_data, "spatial", "URCA", 
                               "URCA.tif")
path_to_pop        <- file.path(path_to_data, "spatial", "population", "WorldPop",
                               "uncontraint_1km_global", "ppp_2019_1km_Aggregated.tif")
path_to_nightlight <- file.path(path_to_data, "spatial", "nighttime_light",
                                "VIIRS", "VNL_v21_npp_2019_global_vcmslcfg_c202205302300.average_masked.dat.tif")
path_to_aez        <- file.path(path_to_data, "spatial", "AgroEcological", "AEZ", 
                                "GAEZv5",  "GAEZ-V5.AEZ33-10km.tif")

# to result folder
path_to_result <- file.path(path_to_data, "result")
1
concatenate the string to make a path
2
.. means one folder back


2.2 Read the Data

2.2.1 Survey Data

We start by reading the surveys data. The survey is stored as dta file, so we use the haven::read_dta() function to read it.

We only need the hhid, the survey coordinates, and the interview dates. We use dplyr::select() to choose these variables. This passage is optional and we bring with us all the variables, but we won’t use them.

Then we create/modify some variables with the function dplyr::mutate(). We transform the the variable interview_date from string into data, and we get the year of the median value of the date of interviews. This passage is important as it allows us to define the most appropriate year to select for the spatial variables.

srvy <- haven::read_dta(path_to_survey) |>
    dplyr::select(survey_year, hhid, country, lat, lon, interview_date) |>
    dplyr::mutate(
        interview_date = clock::date_parse(interview_date,
                                           format = "%Y-%m-%d"),
        survey_year    = clock::get_year(median(interview_date)),
        .before = hhid)
1
read dta type data
2
select relevant variables
3
transform string into date type
4
specify format type
5
find the median year of the interviews

2.2.2 Spatial Data

Finally, we load the spatial data. This data typically comes in the form of raster data. A raster represents a two-dimensional image as a rectangular matrix or grid of pixels. These are spatial rasters because they are georeferenced, meaning each pixel (or “cell” in GIS terms) represents a square region of geographic space. The value of each cell reflects a measurable property (either qualitative or quantitative) of that region.

To spatial data is usually stored as tif file or nc. We can read both of them them with the function terra::rast().

When we print the raster, we obtain several key details. The dimension tells us how many cells the raster consists of and the number of layers, each layer corresponds to a particular months for which the observations were made. We also get the spatial resolution, which defines the size of each square region in geographic space, and the coordinate reference system (CRS), i.e. EPSG:4326.

Important

When working with multiple spatial data, you must ensure that they have the same coordinate reference system (CRS). This is important because in this way all the data can “spatially” talk to each other.

pop <- terra::rast(path_to_pop) |>
  setNames("pop")
pop

nightlight <- terra::rast(path_to_nightlight) |>
  setNames("nightlight")
nightlight

elevation <- terra::rast(path_to_elevation)
elevation

urca <- terra::rast(path_to_urca)
urca

aez <- terra::rast(path_to_aez) |>
    setNames("aez") 
aez
1
read raster type data
2
change the name of the layer
class       : SpatRaster 
dimensions  : 18720, 43200, 1  (nrow, ncol, nlyr)
resolution  : 0.008333333, 0.008333333  (x, y)
extent      : -180.0012, 179.9987, -72.00042, 83.99958  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : ppp_2019_1km_Aggregated.tif 
name        : pop 
class       : SpatRaster 
dimensions  : 33601, 86401, 1  (nrow, ncol, nlyr)
resolution  : 0.004166667, 0.004166667  (x, y)
extent      : -180.0021, 180.0021, -65.00208, 75.00208  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : VNL_v21_npp_2019_global_vcmslcfg_c202205302300.average_masked.dat.tif 
name        : nightlight 
class       : SpatRaster 
dimensions  : 3000, 7200, 1  (nrow, ncol, nlyr)
resolution  : 0.05, 0.05  (x, y)
extent      : -180, 180, -60, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source      : elevation_glofas_v4_0.nc 
varname     : elevation (Height above sea level) 
name        : elevation 
unit        :         m 
class       : SpatRaster 
dimensions  : 17235, 43200, 1  (nrow, ncol, nlyr)
resolution  : 0.008333333, 0.008333333  (x, y)
extent      : -180, 180, -60, 83.625  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : URCA.tif 
name        : URCA 
class       : SpatRaster 
dimensions  : 2160, 4320, 1  (nrow, ncol, nlyr)
resolution  : 0.08333333, 0.08333333  (x, y)
extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : GAEZ-V5.AEZ33-10km.tif 
name        : aez 

Now we also read the weather observation. The same consideration about the coordinate reference system (CRS) is still valid. When we work with raster that have also observations over time, it is important to check how and where the time and date information is stored. Sometimes it is stored in the metadata and you can access it using terra::time(), other time it is already saved as the name of the layer and you can access it using names(). Sometimes, like in this case the date information is stored in the names but the format is based on second passed from 1970-01-01 00:00. To transform this observation into readable date we can use the function second_to_date().

Warning

Note that rasters can store time information in different ways, so it may not always be possible to retrieve dates in this manner. A common alternative is for dates to be embedded in the layer names, in which case we wouldn’t need to rename the layers.

pre <- terra::rast(path_to_pre)
pre
names(pre) <- terra::names(pre) |> second_to_date()
pre

tmp <- terra::rast(path_to_tmp)
names(tmp) <- terra::names(tmp) |> second_to_date()
1
transform the layers name with second into dates
class       : SpatRaster 
dimensions  : 741, 811, 904  (nrow, ncol, nlyr)
resolution  : 0.1, 0.1  (x, y)
extent      : -26.05, 55.05, -36.05, 38.05  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source      : afr_month_50_25_tpr.nc:tp 
varname     : tp (Total precipitation) 
names       : tp_va~52000, tp_va~73600, tp_va~54400, tp_va~76000, tp_va~84000, tp_va~05600, ... 
unit        :           m,           m,           m,           m,           m,           m, ... 
class       : SpatRaster 
dimensions  : 741, 811, 904  (nrow, ncol, nlyr)
resolution  : 0.1, 0.1  (x, y)
extent      : -26.05, 55.05, -36.05, 38.05  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source      : afr_month_50_25_tpr.nc:tp 
varname     : tp (Total precipitation) 
names       : 1950-01-01, 1950-02-01, 1950-03-01, 1950-04-01, 1950-05-01, 1950-06-01, ... 
unit        :          m,          m,          m,          m,          m,          m, ... 

2.2.3 Administrative Boundaries

We now move to read the administrative divisions. We use the function read_geoBoundaries() to do so. This function looks for spatial polygons for the iso and lvl provided provided.

As we have the coordinates, we don’t actually need the administrative divisions for the extraction. However, we will use it to reduce the coverage of the spatial variables and to make some plots.

The same consideration about the coordinate reference system (CRS) is still valid.

adm_div <- read_geoBoundaries(path_to_adm_div, iso = "ETH", lvl = 2)
adm_div
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 74, 4  (geometries, attributes)
 extent      : 33.00224, 47.95925, 3.400365, 14.84602  (xmin, xmax, ymin, ymax)
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       : ID_adm_div   iso adm_div_1 adm_div_2
 type        :      <chr> <chr>     <chr>     <chr>
 values      :          1   ETH    Somali     Afder
                        2   ETH   Gambela    Agnuak
                        3   ETH     SNNPR     Alaba


2.3 Georeference the Surveys

As we’ve mentioned, the spatial data is georeferenced, so we need to ensure the same for the survey data. We use the spatial coordinates to assign the administrative division to each household. We must ensure that we can later associate the correct weather data with the right household, we do this by creating an merging variable called ID.

This is handled by the prepare_coord() function, which requires the coordinates’ variable names as input.

Once we have the unique coordinates, we are ready to transform them into spatial points using the georef_coord() function. When performing this transformation, it’s crucial to set the correct CRS, which must match that of the weather data. The CRS is provided as an argument of the function, using the previously saved CRS from the weather data. Also the georef_coord() function requires the coordinates’ variable names as input. Usually, the WGS 84 CRS is the default coordinate references system for coordinates. In this case it matches the weather coordinate references system.

We can print the result to check the transformation. The new column, ID, is created by prepare_coord() and identifies each unique coordinate. This is used to merge the weather data with the household data.

srvy_coord <- prepare_coord(srvy,
                            lon_var = lon,
                            lat_var = lat)
srvy_coord
# A tibble: 6,505 × 7
   ID    survey_year hhid               country    lat   lon interview_date
   <chr>       <int> <chr>              <chr>    <dbl> <dbl> <date>        
 1 1            2019 051103088801903002 Ethiopia  3.61  39.0 2019-08-28    
 2 1            2019 051103088801903012 Ethiopia  3.61  39.0 2019-08-28    
 3 1            2019 051103088801903022 Ethiopia  3.61  39.0 2019-08-28    
 4 1            2019 051103088801903032 Ethiopia  3.61  39.0 2019-08-29    
 5 1            2019 051103088801903042 Ethiopia  3.61  39.0 2019-08-29    
 6 1            2019 051103088801903052 Ethiopia  3.61  39.0 2019-08-28    
 7 1            2019 051103088801903062 Ethiopia  3.61  39.0 2019-08-28    
 8 1            2019 051103088801903072 Ethiopia  3.61  39.0 2019-08-28    
 9 1            2019 051103088801903082 Ethiopia  3.61  39.0 2019-08-28    
10 1            2019 051103088801903092 Ethiopia  3.61  39.0 2019-08-29    
# ℹ 6,495 more rows

Once we have the unique coordinates, we are ready to transform them into spatial points using the georef_coord() function. When performing this transformation, it’s crucial to set the correct CRS, which must match that of the weather data. The function also the coordinates’ variable names as input.

srvy_geo <- georef_coord(srvy_coord,
                         geom = c("lon", "lat"),
                         crs = "EPSG:4326")
srvy_geo
 class       : SpatVector 
 geometry    : points 
 dimensions  : 516, 1  (geometries, attributes)
 extent      : 33.43483, 47.30784, 3.609384, 14.47715  (xmin, xmax, ymin, ymax)
 coord. ref. : lon/lat WGS 84 (EPSG:4326) 
 names       :    ID
 type        : <chr>
 values      :     1
                   2
                   3


2.4 Merge administrative division and survey

We want to associated the survey location to the administrative divisions. We do it by looking in which administrative division each survey location fall in. We save this information for later use.

srvy_adm_div <- get_poly_attr_for_point(srvy_geo, adm_div)
srvy_adm_div
# A tibble: 516 × 7
   ID      lon   lat ID_adm_div iso   adm_div_1 adm_div_2
   <chr> <dbl> <dbl> <chr>      <chr> <chr>     <chr>    
 1 1      39.0  3.61 10         ETH   Oromia    Borena   
 2 2      41.8  4.01 37         ETH   Somali    Liben    
 3 3      41.9  4.44 1          ETH   Somali    Afder    
 4 4      41.5  4.73 37         ETH   Somali    Liben    
 5 5      36.0  4.74 56         ETH   SNNPR     South Omo
 6 6      38.1  4.85 10         ETH   Oromia    Borena   
 7 7      37.4  4.96 10         ETH   Oromia    Borena   
 8 8      40.7  5.11 37         ETH   Somali    Liben    
 9 9      41.9  5.15 1          ETH   Somali    Afder    
10 10     44.6  5.24 1          ETH   Somali    Afder    
# ℹ 506 more rows


2.5 Crop the spatial variables

The spatial variables variables we have just load have a global coverage. It might be convenient to reduce the coverage to just the countries we are interested in. We can do this by using the crop_with_buffer() function and the administrative divisions. As the name suggest, this function allows to specify a buffer around the vector data to increase the spatial extent and crop a larger portion. This is useful as some survey coordinates are at the edge of the administrative borders or, in some rare cases, just outside the borders as consequence of the coordinates modification fro location anonymization. Further, to compute some spatial indicators in one cell we need the surrounding cell values and if we crop exactly at the borders those cell values at the edge won’t have the he surrounding cells.

The buffer argument of the function specifies the increase around the spatial extent. By default, it is in the same unit of measure of the data.

This is not a compulsory step but it reduce the memory burden and allows for more meaningful plotting.

pop_cntry <- crop_with_buffer(pop, adm_div, buffer = 1)

nghtlght_cntry <- crop_with_buffer(nightlight, adm_div, buffer = 1)

elevatn_cntry <- crop_with_buffer(elevation, adm_div, buffer = 1)

urca_cntry <- crop_with_buffer(urca, adm_div, buffer = 1)

aez_cntry <- crop_with_buffer(aez, adm_div, buffer = 1)

pre_cntry <- crop_with_buffer(pre, adm_div, buffer = 1)

tmp_cntry <- crop_with_buffer(tmp, adm_div, buffer = 1)


2.6 Plot

A good practice when working with spatial data is to plot it. This is the best way to verify that everything is working as expected.

First, we plot the survey coordinates to ensure they are correctly located within the country and to examine their spatial distribution.

terra::plot(adm_div, col = "grey", main = "District of Ethiopia and Survey Coordinates")
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)
1
plot raster
2
add survey locations

We confirm that the survey locations are within the country borders, which is great! We also observe that the spatial distribution of survey coordinates is neither random nor uniform; most are concentrated near the major cities and in the North.

Next, we plot the spatial variables to see how it overlaps with the spatial coordinates.

terra::plot(elevatn_cntry, main = "Elevation")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)
1
plot raster
2
add administrative borders
3
add survey locations

terra::plot(log(1+pop_cntry), main = "Log Population")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)

terra::plot(urca_cntry, main = "URCA")
terra::lines(adm_div, col = "black", lwd = 2)
terra::points(srvy_geo, col = "red", alpha = 1, cex = 0.6)

terra::plot(log(1+nghtlght_cntry), main = "Log Nighttime Light")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 1, cex = 0.6)

terra::plot(tmp_cntry, "2024-03-01", col = terra::map.pal("water"),
            main = "Monthly precipitation at 2024-03 and survey location")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)

terra::plot(tmp_cntry, "2024-03-01", col = terra::map.pal("ryb"),
            main = "Monthly temperature at 2024-03 and survey location")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "black", alpha = 0.5, cex = 0.5)

Once again, the survey coordinates align with the precipitation data, which is great! We can also observe the different spatial resolution, with precipitation having a lower one. The consequence is that some survey coordinates still fall within the same cell.


2.7 Modify the Spatial Variables

2.7.1 Compute Terrain Indicators

Now we compute some terrain indicators based on elevation. The terrain indicators are:

  • TRI (Terrain Ruggedness Index) is the mean of the absolute differences between the value of a cell and its 8 surrounding cells.

  • Slope is the average difference between the value of a cell and its 8 surrounding cells.

  • Roughness is the difference between the maximum and the minimum value of a cell and its 8 surrounding cells.

terrain_cntry <-  terra::terrain(elevatn_cntry,
                                 v = c("slope", "TRI", "roughness"),
                                 neighbors = 8,
                                 unit = "degrees")

nghtlght_cntry$ln_nightlight <- log(1 + nghtlght_cntry)
1
the terrain indicators
2
how many neighboring cells, 8 (queen case) or 4 (rook case)

2.7.2 Weather Variable Transformation

The original unit of measure of the weather data is in meter for precipitation and Kelvin for temperature. These unit of measure are not very intuitive, therefore we change them into millimeter and Celsius respectively.

# From meter to millimeters
pre_cntry_mm <- pre_cntry*1000

# From Kelvin to Celsius
tmp_cntry_c <- tmp_cntry - 273.15


2.7.3 Compute the Surface Area of the Administrative Divisons

We now compute the surface are of each administrative division. We will use it for computing the population density. We use the function dplyr::mutate() to add the variable area_km and the function terra::expanse() to compute the surface area.

adm_div_area <- adm_div |>
    dplyr::mutate(area_km = terra::expanse(adm_div, unit = "km"))


2.8 Extraction

2.8.1 Spatial Variables

We extract the spatial data based on the administrative division using the extract_by_poly() function. This function requires the raster with the spatial data, the administrative division, and the aggregation function as input. The aggregation function, fn_agg, defines how the cell values that fall within an administrative division are combined into a single value. Note that by default all the cell values are weighted by the coverage area of the cell that fall within the division.

Contrary to the other spatial variable, for population we use adm_div_area to extract the values as we need the surface area to calculate the population density.

Looking at the result, we see first the ID_adm_div column, that identifies the unique administrative divisions. The second to fourth column are the additional information coming from the administrative division data. The last column contain the spatial observations aggregated at the administrative division.

pop_adm <- extract_by_poly(pop_cntry, adm_div_area, fn_agg = "sum")
pop_adm
# A tibble: 74 × 6
   ID_adm_div iso   adm_div_1        adm_div_2  area_km      pop
   <chr>      <chr> <chr>            <chr>        <dbl>    <dbl>
 1 1          ETH   Somali           Afder       62044.  791202.
 2 2          ETH   Gambela          Agnuak      23507.  250289.
 3 3          ETH   SNNPR            Alaba         863.  326562.
 4 4          ETH   Oromia           Arsi        20974. 3761471.
 5 5          ETH   Beneshangul Gumu Asosa       14397.  513373.
 6 6          ETH   Amhara           Awi/Agew     8950. 1249906.
 7 7          ETH   Oromia           Bale        54644. 2011606.
 8 8          ETH   SNNPR            Basketo       419.   79602.
 9 9          ETH   SNNPR            Bench Maji  19085.  921810.
10 10         ETH   Oromia           Borena      52437. 1406311.
# ℹ 64 more rows
nghtlght_adm <- extract_by_poly(nghtlght_cntry, adm_div, fn_agg = "mean")

elevation_adm <- extract_by_poly(elevatn_cntry, adm_div, fn_agg = "mean")

terrain_adm <- extract_by_poly(terrain_cntry, adm_div, fn_agg = "mean")

urca_adm <- extract_by_poly(urca_cntry, adm_div, fn_agg = "modal")

aez_adm <- extract_by_poly(aez_cntry, adm_div, fn_agg = "modal")

aez_adm
# A tibble: 74 × 5
   ID_adm_div iso   adm_div_1        adm_div_2    aez
   <chr>      <chr> <chr>            <chr>      <dbl>
 1 1          ETH   Somali           Afder         26
 2 2          ETH   Gambela          Agnuak         2
 3 3          ETH   SNNPR            Alaba          5
 4 4          ETH   Oromia           Arsi           6
 5 5          ETH   Beneshangul Gumu Asosa          2
 6 6          ETH   Amhara           Awi/Agew       5
 7 7          ETH   Oromia           Bale           1
 8 8          ETH   SNNPR            Basketo       26
 9 9          ETH   SNNPR            Bench Maji    26
10 10         ETH   Oromia           Borena         1
# ℹ 64 more rows

2.8.2 Weather Variables

Fro weather data, we use a different function for extracting the data, namely extract_cell_by_poly(). Contrary to the function extract_by_poly(), this doesn’t aggregate the values within the polygons but extract each all the cell values within the division separately. This is important as we want to compute the long run climatic parameter for cell and only later aggregate them.

Note

To extract each cells is more computationally and memory demanding, especially with large countries and long time series, but it increases precision as the aggregation, thus lost of information, is done at very last stage of the process.

Looking at the result, we see first the ID_adm_div column, that identifies the unique administrative divisions. The second and third column are the coordinates of the cells. The fourth is the amount of the cell that actually falls within the administrative division. The other columns contain the weather observations over time specific to each cell.

pre_cell <- extract_cell_by_poly(pre_cntry_mm, adm_div)

tmp_cell <- extract_cell_by_poly(tmp_cntry_c, adm_div)
tmp_cell
# A tibble: 11,845 × 908
   ID_adm_div x_cell y_cell coverage_fraction X1950_01_01 X1950_02_01
   <chr>       <dbl>  <dbl>             <dbl>       <dbl>       <dbl>
 1 1            41      6.7            0.0309        23.0        24.3
 2 1            41.1    6.7            0.398         23.8        25.1
 3 1            41.2    6.7            0.0640        23.9        25.2
 4 1            40.9    6.6            0.0118        22.0        23.3
 5 1            41      6.6            0.673         22.9        24.2
 6 1            41.1    6.6            1             23.8        25.1
 7 1            41.2    6.6            0.887         24.2        25.5
 8 1            41.3    6.6            0.430         24.6        25.8
 9 1            41.4    6.6            0.0915        24.6        25.8
10 1            41.8    6.6            0.103         26.9        28.0
# ℹ 11,835 more rows
# ℹ 902 more variables: X1950_03_01 <dbl>, X1950_04_01 <dbl>,
#   X1950_05_01 <dbl>, X1950_06_01 <dbl>, X1950_07_01 <dbl>, X1950_08_01 <dbl>,
#   X1950_09_01 <dbl>, X1950_10_01 <dbl>, X1950_11_01 <dbl>, X1950_12_01 <dbl>,
#   X1951_01_01 <dbl>, X1951_02_01 <dbl>, X1951_03_01 <dbl>, X1951_04_01 <dbl>,
#   X1951_05_01 <dbl>, X1951_06_01 <dbl>, X1951_07_01 <dbl>, X1951_08_01 <dbl>,
#   X1951_09_01 <dbl>, X1951_10_01 <dbl>, X1951_11_01 <dbl>, …


2.9 Cmpute Long Run Climatic Parameter

We want to describe the long run climatic condition in each locations. Rule of thumb is to use 30 years of weather observations to capture climatic features. Therefore, we select the 30 years before each survey.

Check the names with the date of observations and how it has changed since before.

pre_cell_30yrs <- select_by_dates(pre_cell, from = "1989", to = "2019" )
tmp_cell_30yrs <- select_by_dates(tmp_cell, from = "1989", to = "2019")
tmp_cell_30yrs
# A tibble: 11,845 × 365
   ID_adm_div x_cell y_cell coverage_fraction X1989_01_01 X1989_02_01
   <chr>       <dbl>  <dbl>             <dbl>       <dbl>       <dbl>
 1 1            41      6.7            0.0309        24.3        24.2
 2 1            41.1    6.7            0.398         25.0        24.9
 3 1            41.2    6.7            0.0640        25.0        25.0
 4 1            40.9    6.6            0.0118        23.5        23.3
 5 1            41      6.6            0.673         24.1        24.1
 6 1            41.1    6.6            1             24.9        24.9
 7 1            41.2    6.6            0.887         25.3        25.3
 8 1            41.3    6.6            0.430         25.5        25.4
 9 1            41.4    6.6            0.0915        25.4        25.3
10 1            41.8    6.6            0.103         27.4        27.8
# ℹ 11,835 more rows
# ℹ 359 more variables: X1989_03_01 <dbl>, X1989_04_01 <dbl>,
#   X1989_05_01 <dbl>, X1989_06_01 <dbl>, X1989_07_01 <dbl>, X1989_08_01 <dbl>,
#   X1989_09_01 <dbl>, X1989_10_01 <dbl>, X1989_11_01 <dbl>, X1989_12_01 <dbl>,
#   X1990_01_01 <dbl>, X1990_02_01 <dbl>, X1990_03_01 <dbl>, X1990_04_01 <dbl>,
#   X1990_05_01 <dbl>, X1990_06_01 <dbl>, X1990_07_01 <dbl>, X1990_08_01 <dbl>,
#   X1990_09_01 <dbl>, X1990_10_01 <dbl>, X1990_11_01 <dbl>, …

Now we can compute the long run climatic parameter. We calculate the mean, the standard deviation, and the coefficient of variation. We collect all the parameter in a separate object parameter. This object is a names list of functions and we construct it with this structure name = function, then the list() function puts them together. This passage is not compulsory but allows to perform the computation of multiple parameters in a tidy and efficient way. Otherwise we could have directly add them inside the calc_par().

The function calc_par() calculates the required parameters.

The results have a similar structure, with the first columns that identify the specific locations and the other the computed parameters. Note how we are still carrying on the coverage_fraction variable as we will need it for aggregating the climatic parameter at the administrative division.

parameter <- list(std = sd, avg = mean, coef_var = cv)
parameter
$std
function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))
<bytecode: 0x3de2d0fd0>
<environment: namespace:stats>

$avg
function (x, ...) 
UseMethod("mean")
<bytecode: 0x1114a8170>
<environment: namespace:base>

$coef_var
function(x, na_rm = TRUE) {
    avg <- mean(x, na.rm = na_rm)
    if (is.nan(avg)) return(NA_real_)
    if (avg == 0) return(NA_real_)  # Avoid division by zero
    sd(x, na.rm = na_rm) / mean(x, na.rm = na_rm)
}
<bytecode: 0x3de2d2b70>
<environment: namespace:climatic4economist>
pre_par_cell <- calc_par(pre_cell_30yrs, pars = parameter, prefix = "pre")
tmp_par_cell <- calc_par(tmp_cell_30yrs, pars = parameter, prefix = "tmp")

tmp_par_cell
# A tibble: 11,845 × 7
   ID_adm_div x_cell y_cell coverage_fraction tmp_std tmp_avg tmp_coef_var
   <chr>       <dbl>  <dbl>             <dbl>   <dbl>   <dbl>        <dbl>
 1 1            40.9   5.7             0.249     1.33    26.8       0.0497
 2 1            40.9   5.8             0.673     1.37    27.1       0.0505
 3 1            40.9   5.9             0.725     1.37    26.6       0.0516
 4 1            40.9   6               0.728     1.32    25.5       0.0518
 5 1            40.9   6.1             0.811     1.28    24.8       0.0517
 6 1            40.9   6.2             0.719     1.29    24.5       0.0525
 7 1            40.9   6.3             0.780     1.25    24.1       0.0520
 8 1            40.9   6.4             0.892     1.24    23.8       0.0521
 9 1            40.9   6.5             0.509     1.22    23.4       0.0520
10 1            40.9   6.6             0.0118    1.18    22.9       0.0515
# ℹ 11,835 more rows

We have computed the climatic parameters for each cells but we still need to aggregate them at the administrative divisions. The function agg_to_adm_div() can do it for us, be aware the the function aggregate by using the weighted mean, where the weights are provided by the coverage_fraction variable.

In the results we lose the the information on the specific cells and we are left only with the administrative division id, ID_adm_div, and a single value of the climatic parameters for each locations.

pre_par_adm <- agg_to_adm_div(pre_par_cell , match_col = "pre")
tmp_par_adm <- agg_to_adm_div(tmp_par_cell, match_col = "tmp")

tmp_par_adm
# A tibble: 74 × 4
   ID_adm_div tmp_std tmp_avg tmp_coef_var
   <chr>        <dbl>   <dbl>        <dbl>
 1 1             1.27    27.4       0.0464
 2 10            1.59    22.5       0.0715
 3 11            1.84    20.0       0.0920
 4 12            1.65    20.9       0.0789
 5 13            1.76    21.6       0.0808
 6 14            1.17    26.5       0.0441
 7 15            1.68    18.3       0.0919
 8 16            1.18    21.3       0.0558
 9 17            1.40    20.2       0.0697
10 18            2.11    20.4       0.103 
# ℹ 64 more rows


2.10 Merge with Survey

Now that we have everything, we can combine all the extracted data and then merge them with the survey. We start by combining the data into a unique data set. To do so we start by create a list with the function list(), each element of the list is a different spatial variable and then we combine the elements of the list with the function purrr::reduce(). This last function require another function as input to drive the combination and we choose to use merge_by_common(), which merges two data by their common variable names.

Why not using directly merge_by_common()? Because the function works with just two datasets and we have eight different spatial datasets. We can cumulatively merge the datasets one by one or we can use the purrr::reduce().

Then we compute also the population density and the logarithmic transformation of the nighttime light. We use the dplyr::mutate() function to add these two new variables. We use the argument .after to specify where the position of the variable among the columns.

sptl_adm <- list(pop_adm,
                 nghtlght_adm,
                 terrain_adm, 
                 elevation_adm, 
                 urca_adm, 
                 aez_adm, 
                 pre_par_adm, 
                 tmp_par_adm) |>
    purrr::reduce(merge_by_common) |>
    dplyr::mutate(pop_density = pop/area_km, .after = pop)

sptl_adm
1
combine the data into a list
2
merge all the elements of the list
3
create new variables
# A tibble: 74 × 21
   ID_adm_div iso   adm_div_1    adm_div_2 area_km    pop pop_density nightlight
   <chr>      <chr> <chr>        <chr>       <dbl>  <dbl>       <dbl>      <dbl>
 1 1          ETH   Somali       Afder      62044. 7.91e5        12.8   0.000139
 2 2          ETH   Gambela      Agnuak     23507. 2.50e5        10.6   0.00118 
 3 3          ETH   SNNPR        Alaba        863. 3.27e5       378.    0.0387  
 4 4          ETH   Oromia       Arsi       20974. 3.76e6       179.    0.0108  
 5 5          ETH   Beneshangul… Asosa      14397. 5.13e5        35.7   0.00389 
 6 6          ETH   Amhara       Awi/Agew    8950. 1.25e6       140.    0.00539 
 7 7          ETH   Oromia       Bale       54644. 2.01e6        36.8   0.00182 
 8 8          ETH   SNNPR        Basketo      419. 7.96e4       190.    0       
 9 9          ETH   SNNPR        Bench Ma…  19085. 9.22e5        48.3   0.00289 
10 10         ETH   Oromia       Borena     52437. 1.41e6        26.8   0.00143 
# ℹ 64 more rows
# ℹ 13 more variables: ln_nightlight <dbl>, slope <dbl>, TRI <dbl>,
#   roughness <dbl>, elevation <dbl>, URCA <dbl>, aez <dbl>, pre_std <dbl>,
#   pre_avg <dbl>, pre_coef_var <dbl>, tmp_std <dbl>, tmp_avg <dbl>,
#   tmp_coef_var <dbl>

Now that we have all the control variables together, we can merge them with the surveys information. The function merge_by_common() will do it for us.

However, the surveys do not carry information on the administrative division we have used, therefore we need an additional step to provide this information. We calculated this link information before and save it as srvy_adm_div.

We first merge the link information with the spatial extracted variables, the output is then merge with the survey. Note that the pipe command |> assumes that the left side is the first argument in the function, as it is not the case for us we need to specify it with y = _, where y is the name of the argument and _ refer to the previous merge.

We can see that the result has all the information we retained from the surveys, the information about the administrative divisions, and the new extracted spatial variables.

srvy_sptl_adm <- merge_by_common(srvy_adm_div, sptl_adm) |>
    merge_by_common(srvy_coord, y = _)
srvy_sptl_adm
1
merge adm info with spatial var
2
_ refers to the output of the previous merge
# A tibble: 6,506 × 28
   ID    survey_year hhid    country   lat   lon interview_date ID_adm_div iso  
   <chr>       <int> <chr>   <chr>   <dbl> <dbl> <date>         <chr>      <chr>
 1 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
 2 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
 3 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
 4 1            2019 051103… Ethiop…  3.61  39.0 2019-08-29     10         ETH  
 5 1            2019 051103… Ethiop…  3.61  39.0 2019-08-29     10         ETH  
 6 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
 7 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
 8 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
 9 1            2019 051103… Ethiop…  3.61  39.0 2019-08-28     10         ETH  
10 1            2019 051103… Ethiop…  3.61  39.0 2019-08-29     10         ETH  
# ℹ 6,496 more rows
# ℹ 19 more variables: adm_div_1 <chr>, adm_div_2 <chr>, area_km <dbl>,
#   pop <dbl>, pop_density <dbl>, nightlight <dbl>, ln_nightlight <dbl>,
#   slope <dbl>, TRI <dbl>, roughness <dbl>, elevation <dbl>, URCA <dbl>,
#   aez <dbl>, pre_std <dbl>, pre_avg <dbl>, pre_coef_var <dbl>, tmp_std <dbl>,
#   tmp_avg <dbl>, tmp_coef_var <dbl>


2.11 Write

Here we are at the end, let’s save the results. We want to save the result as dta so we will use the haven::write_dta() function.

haven::write_dta(srvy_sptl_adm,
                 file.path(path_to_result, "ETH_sp_adm.dta"))



3 Take home messages

  • When working with multiple spatial data:

    1. remember to control the Coordinate Reference System of all dataset
    2. plot the data to check everything is going well
  • based on the typology of data we use different function

    • reading

      • for dta use haven::read_dta() or and haven::_write_dta()
      • for spatial vectors use terra::vect() or read_adm_div() for administrative divisions in specific country and level.
      • for spatial raster use terra::rast()
    • extraction

      • spatial points, use extract_by_coord()
      • spatial polygons, use extract_by_poly()
      • cells within polygons, use extract_cell_by_poly()
  • When working with raster data

    1. check the unit of measure
    2. if it is a time series check also the date format
  • When working with spatial polygons, like administrative divisions valuate if you want to extract the values already aggregated or each cells separately

    • for example the terrain indicators were computed for each cells and then we moved to the extraction
    • for the climatic parameters, we extract each cells separately, we compute the parameters for each cells, and only later we aggregate them.
  • When georeferencing the survey location we take advantage that many interviews share the same locations. Hence, we extract the variables just for these unique locations. However, same of these unique locations may fall within the same value cells, so the actual information might be even lower.



4 Appendix

4.1 Want to know about the data?

4.1.1 Weather

Weather observation are obtained from ERA5-Land reanalysis dataset. H-TESSEL is the land surface model that is the basis of ERA5-Land. The data is a post-processed monthly-mean average of the original ERA5-Land dataset.

Parameter Value
spatial resolution 0.1° x 0.1° lon lat
temporal resolution month
time frame Jan. 1950 - Dec. 2022
unit of measure meter or Kelvin

Suggested citation:

  • Muñoz Sabater, J. (2019): ERA5-Land monthly averaged data from 1950 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.68d2bb30

It is possible to find additional information:

The data can be freely download from

4.1.1.0.1 Total precipitation

Accumulated liquid and frozen water, including rain and snow, that falls to the Earth’s surface. It is the sum of large-scale precipitation and convective precipitation. Precipitation variables do not include fog, dew or the precipitation that evaporates in the atmosphere before it lands at the surface of the Earth.

4.1.1.0.2 2 metre above ground temperature

Temperature of air at 2m above the surface of land, sea or in-land waters. 2m temperature is calculated by interpolating between the lowest model level and the Earth’s surface, taking account of the atmospheric conditions.


4.1.2 Spatial variables

4.1.2.1 Agro Ecological Zones

The Agro-ecological Zones classification (33 classes) provides a characterization of bio-physical resources relevant to agricultural production systems. AEZ definitions and map classes follow a rigorous methodology and an explicit set of principles. The inventory combines spatial layers of thermal and moisture regimes with broad categories of soil/terrain qualities. It also indicates locations of areas with irrigated soils and shows land with severely limiting bio-physical constraints including very cold and very dry (desert) areas as well as areas with very steep terrain or very poor soil/terrain conditions. The AEZ classification dataset is part of the GAEZ v5 Land and Water Resources theme and Agro-ecological Zones sub-theme. All results are derived from the Agro-ecological Zones (AEZ) modeling framework, developed collaboratively by the Food and Agriculture Organization (FAO) and the International Institute for Applied Systems Analysis (IIASA).

Parameter Value
spatial resolution 10 km.
temporal resolution 20 years
time frame 2001–2020
unit of measure classification by climate/soil/terrain/LC (33 classes)

Suggested citation:

  • FAO & IIASA. 2025. Global Agro-ecological Zoning version 5 (GAEZ v5) Model Documentation. https://github.com/un-fao/gaezv5/wiki

It is possible to find additional information:

The data can be freely download from:


4.1.2.2 Urban-Rural Catchment Area (URCA)

Urban–rural catchment areas showing the catchment areas around cities and towns of different sizes (the no data value is 128). Each rural pixel is assigned to one defined travel time category to one of seven urban agglomeration sizes.

Parameter Value
spatial resolution 0.03° x 0.03° lon lat
temporal resolution year
time frame 2015
unit of measure travel time category to different urban hierarchy

Suggested citation:

  • Cattaneo, Andrea; Nelson, Andy; McMenomy, Theresa (2020). Urban-rural continuum. figshare. Dataset. https://doi.org/10.6084/m9.figshare.12579572.v4

It is possible to find additional information:

The data can be freely download from:


4.1.2.3 Population

The units are number of people per pixel. The mapping approach is Random Forest-based dasymetric redistribution.

Parameter Value
spatial resolution 30 arc second (~1km)
temporal resolution year
time frame 2010 - 2020
unit of measure estimated count of people per grid-cell

Suggested citation:

  • WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00647

It is possible to find additional information from:

The data can be freely download from:


4.1.2.4 Nighttime light

VIIRS nighttime lights (VNL) version V2.1: annual values obtained by from the monthly averages with filtering to remove extraneous features such as biomass burning, aurora, and background.

Parameter Value
spatial resolution 15 arc second
temporal resolution year
time frame 2012 - 2021
unit of measure nW/cm2/sr, average-masked

Suggested citation:

  • Elvidge, C.D, Zhizhin, M., Ghosh T., Hsu FC, Taneja J. Annual time series of global VIIRS nighttime lights derived from monthly averages:2012 to 2019. Remote Sensing 2021, 13(5), p.922, doi:10.3390/rs13050922

It is possible to find additional information:

The data can be freely download from:


4.1.2.5 Elevation

The Global Flood Awareness System (GloFAS) is one component of the Copernicus Emergency Management Service (CEMS). It is designed to support preparatory measures for flood events worldwide, particularly in large transnational river basins.

Elevation is obtained from the auxiliary variables of GloFAS. Each pixel is the mean height elevation above sea level.

Parameter Value
spatial resolution 0.03° x 0.03° lon lat
temporal resolution 30 years
time frame 1981 - 2010
unit of measure Meter (m)

Web resources:

Data access:


4.1.3 Survey

The Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) is a unique system of longitudinal surveys designed to improve the understanding of household and individual welfare, livelihoods and smallholder agriculture in Africa. The LSMS team works with national statistics offices to design and implement household surveys with a strong focus on agriculture.

Suggested citation:

  • Central Statistics Agency of Ethiopia. (2020). Socioeconomic Survey 2018-2019 [Data set]. World Bank, Development Data Group. https://doi.org/10.48529/K739-C548

It is possible to find additional information:

The data can be freely download from:


4.1.4 Administrative boundaries

The administrative divisions are obtained from GeoBoundaries[^2]. GeoBoundaries Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries Database is an online, open license (CC BY 4.0) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, we have tracked approximately 1 million boundaries within over 200 entities, including all UN member states.

Suggested citation:

  • Runfola D, Anderson A, Baier H, Crittenden M, Dowker E, Fuhrig S, et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866.

It is possible to find additional information:

The data can be freely download from: